relative feedback
CueLearner: Bootstrapping and local policy adaptation from relative feedback
Schiavi, Giulio, Cramariuc, Andrei, Ott, Lionel, Siegwart, Roland
Human guidance has emerged as a powerful tool for enhancing reinforcement learning (RL). However, conventional forms of guidance such as demonstrations or binary scalar feedback can be challenging to collect or have low information content, motivating the exploration of other forms of human input. Among these, relative feedback (i.e., feedback on how to improve an action, such as "more to the left") offers a good balance between usability and information richness. Previous research has shown that relative feedback can be used to enhance policy search methods. However, these efforts have been limited to specific policy classes and use feedback inefficiently. In this work, we introduce a novel method to learn from relative feedback and combine it with off-policy reinforcement learning. Through evaluations on two sparse-reward tasks, we demonstrate our method can be used to improve the sample efficiency of reinforcement learning by guiding its exploration process. Additionally, we show it can adapt a policy to changes in the environment or the user's preferences. Finally, we demonstrate real-world applicability by employing our approach to learn a navigation policy in a sparse reward setting.
Fusing Reward and Dueling Feedback in Stochastic Bandits
Wang, Xuchuang, Zeng, Qirun, Zuo, Jinhang, Liu, Xutong, Hajiesmaili, Mohammad, Lui, John C. S., Wierman, Adam
This paper investigates the fusion of absolute (reward) and relative (dueling) feedback in stochastic bandits, where both feedback types are gathered in each decision round. We derive a regret lower bound, demonstrating that an efficient algorithm may incur only the smaller among the reward and dueling-based regret for each individual arm. We propose two fusion approaches: (1) a simple elimination fusion algorithm that leverages both feedback types to explore all arms and unifies collected information by sharing a common candidate arm set, and (2) a decomposition fusion algorithm that selects the more effective feedback to explore the corresponding arms and randomly assigns one feedback type for exploration and the other for exploitation in each round. The elimination fusion experiences a suboptimal multiplicative term of the number of arms in regret due to the intrinsic suboptimality of dueling elimination. In contrast, the decomposition fusion achieves regret matching the lower bound up to a constant under a common assumption. Extensive experiments confirm the efficacy of our algorithms and theoretical results.
Conversational Dueling Bandits in Generalized Linear Models
Yang, Shuhua, Yuan, Hui, Zhang, Xiaoying, Wang, Mengdi, Zhang, Hong, Wang, Huazheng
Conversational recommendation systems elicit user preferences by interacting with users to obtain their feedback on recommended commodities. Such systems utilize a multi-armed bandit framework to learn user preferences in an online manner and have received great success in recent years. However, existing conversational bandit methods have several limitations. First, they only enable users to provide explicit binary feedback on the recommended items or categories, leading to ambiguity in interpretation. In practice, users are usually faced with more than one choice. Relative feedback, known for its informativeness, has gained increasing popularity in recommendation system design. Moreover, current contextual bandit methods mainly work under linear reward assumptions, ignoring practical non-linear reward structures in generalized linear models. Therefore, in this paper, we introduce relative feedback-based conversations into conversational recommendation systems through the integration of dueling bandits in generalized linear models (GLM) and propose a novel conversational dueling bandit algorithm called ConDuel. Theoretical analyses of regret upper bounds and empirical validations on synthetic and real-world data underscore ConDuel's efficacy. We also demonstrate the potential to extend our algorithm to multinomial logit bandits with theoretical and experimental guarantees, which further proves the applicability of the proposed framework.
Comparison-based Conversational Recommender System with Relative Bandit Feedback
Xie, Zhihui, Yu, Tong, Zhao, Canzhe, Li, Shuai
With the recent advances of conversational recommendations, the recommender system is able to actively and dynamically elicit user preference via conversational interactions. To achieve this, the system periodically queries users' preference on attributes and collects their feedback. However, most existing conversational recommender systems only enable the user to provide absolute feedback to the attributes. In practice, the absolute feedback is usually limited, as the users tend to provide biased feedback when expressing the preference. Instead, the user is often more inclined to express comparative preferences, since user preferences are inherently relative. To enable users to provide comparative preferences during conversational interactions, we propose a novel comparison-based conversational recommender system. The relative feedback, though more practical, is not easy to be incorporated since its feedback scale is always mismatched with users' absolute preferences. With effectively collecting and understanding the relative feedback from an interactive manner, we further propose a new bandit algorithm, which we call RelativeConUCB. The experiments on both synthetic and real-world datasets validate the advantage of our proposed method, compared to the existing bandit algorithms in the conversational recommender systems.
Kalloori
Recommender systems are widely developed to learn user preferences from their past history and make predictions on the unseen items a user may like. User preferences in the form of absolute preferences, such as user ratings or clicks are commonly used to model a user's interest and generate recommendations. However, rating items is not the most natural mechanism that users use for making decisions in daily life. For instance, we do not rate t-shirts when we want to buy one. It is more likely that we will compare them one to one, and purchase the preferred one.
Modeling User Preferences Using Relative Feedback for Personalized Recommendations
Kalloori, Saikishore ( Swiss Federal Institute of Technology in Zurich ) | Li, Tianyu (Rakuten Institute of Technology)
Recommender systems are widely developed to learn user preferences from their past history and make predictions on the unseen items a user may like. User preferences in the form of absolute preferences, such as user ratings or clicks are commonly used to model a user’s interest and generate recommendations. However, rating items is not the most natural mechanism that users use for making decisions in daily life. For instance, we do not rate t-shirts when we want to buy one. It is more likely that we will compare them one to one, and purchase the preferred one. In this work, we focus on relative feedback, which generates pairwise preferences as an alternative way to model user preferences and compute recommendations. In our scenario, each user is shown a set of item pairs and asked to compare them to indicate which item in the pair is more preferred. We propose a recommendation algorithm to predict a user’s relative preference for a given pairs of items and compute a personalised ranking of items. We demonstrate the effectiveness of our proposed algorithm in comparison with state-of-the-art relative feedback based recommendation approaches. Our experimental results reveal that the proposed algorithm is able to outperform the baseline algorithms on popular ranking-oriented evaluation metrics.